PROBABILISTIC METHODOLOGY FOR RECORD LINKAGE DETERMINING ROBUSTNESS OF WEIGHTS By:

نویسندگان

  • Krista P. Jensen
  • John S Lawson
چکیده

Over time, the world population has developed a desire to research their ancestoral linage. Many resources have been identified to aid an individual in genealogical research. In the United States, one of the greatest resources for researching genealogy is census records. Census records allow a genealogical researcher to track individuals over time, broadening the scope of information one can acquire about an individual. Dr. Halbert Dunn first presented the concept of record linkage in 1946 to describe the process, which joins two separate pieces of information for a particular individual or family [Dunn 1946]. Later, Fellegi and Sunter [1969] built upon Dunn’s foundations by establishing a probabilistic mathematical approach to record linkage. Probabilistic methods for record linkage have been developed to mimic the decision process of genealogists and researchers. An automated probabilistic approach allows the researcher to conduct many different types of searches within seconds. Following an automated search a list of record matches (links) as well as potential matches (links) with information necessary to further explore each potential record pair can be made. This enables a researcher to compile large numbers of records in a fraction of the time it would take to process manually. Probabilistic methods have been applied to determine the feasibility of linking persons across multiple census years. With a set of known weights to use in the record linkage process, one would eliminate the need to examine a large number of records manually. This paper uses probabilistic methods to link census records from the 1910 and the 1920 census indices to illustrate the benefits of an automated record linkage approach. CENSUS INDICES Since before 1850, census records have provided information regarding one’s demographic and personal information. Census indices contain a subset of the information found on a census page. In addition to omitting some information, census indices only include records for the head of household and individuals that differ in last name from the head of household [Szucs 2001]. Because of the limited information available in a census index, the defining demographics to be used in record linkage are likewise limited. The subset of information found in a census index is as follows: surname, given name (sometimes a middle name or initial is present in the field for given name), age at the time of census, gender, race, country of origin, state of residence, county of residence, district of residence, and information about the census page the information is located. When linking census records from any time period, it is important to account for discrepancies between censuses and failings inherent in the censuses. Because records from 1910 and 1920 have been used herein, issues relating to these census years will be presented. In 1918, at the end of World War I many Eastern European boundaries were realigned, changing the “place of origin” for many immigrants in the United States. For instance, an individual listing their “place of origin” as Prussia in1910 would list Germany in 1920. Though the country of origin is typically stated, many instances arise where a region or city is given instead of a country, like Bavaria, a major region in Southern Germany. The most prominent concern in using census records is the reliability of the reported age. Many individuals were secretive about their age or were unaware of their actual birth date.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Probabilistic Linkage of Persian Record with Missing Data

Extended Abstract. When the comprehensive information about a topic is scattered among two or more data sets, using only one of those data sets would lead to information loss available in other data sets. Hence, it is necessary to integrate scattered information to a comprehensive unique data set. On the other hand, sometimes we are interested in recognition of duplications in a data set. The i...

متن کامل

Probabilistic record linkage

Studies involving the use of probabilistic record linkage are becoming increasingly common. However, the methods underpinning probabilistic record linkage are not widely taught or understood, and therefore these studies can appear to be a 'black box' research tool. In this article, we aim to describe the process of probabilistic record linkage through a simple exemplar. We first introduce the c...

متن کامل

[Accuracy of the probabilistic record linkage methodology to ascertain deaths in survival studies].

Probabilistic record linkage methodology has been increasingly used to ascertain outcomes in cohort studies. However, only a few studies have evaluated its accuracy. The aim of this study was to evaluate the accuracy of probabilistic record linkage methodology to ascertain deaths in a cohort of 250 elderly people hospitalized for fractures caused by falls. The vital status of cohort members was...

متن کامل

Data Preparation for Biomedical Knowledge Domain Visualization: A Probabilistic Record Linkage and Information Fusion Approach to Citation Data

Data Preparation for Biomedical Knowledge Domain Visualization: A Probabilistic Record Linkage and Information Fusion Approach to Citation Data Marie B Synnestvedt Xia Lin Ph.D. This thesis presents a methodology of data preparation with probabilistic record linkage and information fusion for improving and enriching information visualizations of biomedical citation data. The problem of record l...

متن کامل

G-LINK: A Probabilistic Record Linkage System

At Statistics Canada, matching data without unique identifiers is a common practice. The probabilistic record linkage method developed by Ivan Fellegi and Allan Sunter 1 is the primary method recommended by Statistics Canada for this type of matching. In recent decades, work began to generalize the Fellegi–Sunter algorithm in order to offer our community the opportunity to use this methodology ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005